provably efficient neural gtd
Provably Efficient Neural GTD for Off-Policy Learning
This paper studies a gradient temporal difference (GTD) algorithm using neural network (NN) function approximators to minimize the mean squared Bellman error (MSBE). For off-policy learning, we show that the minimum MSBE problem can be recast into a min-max optimization involving a pair of over-parameterized primal-dual NNs. The resultant formulation can then be tackled using a neural GTD algorithm. We analyze the convergence of the proposed algorithm with a 2-layer ReLU NN architecture using $m$ neurons and prove that it computes an approximate optimal solution to the minimum MSBE problem as $m \rightarrow \infty$.
Review for NeurIPS paper: Provably Efficient Neural GTD for Off-Policy Learning
Weaknesses: The philosophy of establishing convergence guarantees for neural networks under specific numbers of neurons is strange, because the number of neurons is a very coarse description of a network that can already be established by nonparametric estimators, i.e., Cho, Youngmin, and Lawrence K. Saul. And numerous follow up works. Therefore, if the neural network analysis is to refine this approach, then it must also specify the *inter-layer* relationships and broader architectural choices to actually be useful to practitioners. As is, I don't see how the m of Lemma 4.1 can actually be used to inform choice of a neural architecture in any sharper manner than, e.g., a single layer RBF network. Also, reformulating Bellman's equations into saddle point problems has been previously studied: Shapiro, A. (2011).
Review for NeurIPS paper: Provably Efficient Neural GTD for Off-Policy Learning
This paper generated substantial discussion from the reviewers. Reviewer 1's points of lack of contextualization are well-taken by the other reviewers. That said, the meta-reviewer (in consultation with the Senior Area Chair) agrees that the theoretical contribution will be of interest to the NeurIPS community, and the clarity & sharpness of the authors' response suggests the authors are quite capable of revising the paper to more clearly discuss context and articulate their contribution. As such, the metareviewer is recommending accept.
Provably Efficient Neural GTD for Off-Policy Learning
This paper studies a gradient temporal difference (GTD) algorithm using neural network (NN) function approximators to minimize the mean squared Bellman error (MSBE). For off-policy learning, we show that the minimum MSBE problem can be recast into a min-max optimization involving a pair of over-parameterized primal-dual NNs. The resultant formulation can then be tackled using a neural GTD algorithm. We analyze the convergence of the proposed algorithm with a 2-layer ReLU NN architecture using m neurons and prove that it computes an approximate optimal solution to the minimum MSBE problem as m \rightarrow \infty .